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Abstract 

In this paper, we study the global convergence of majorization minimization (MM) al¬ 
gorithms for solving nonconvex regularized optimization problems. MM algorithms have 
received great attention in machine learning. However, when applied to nonconvex opti¬ 
mization problems, the convergence of MM algorithms is a challenging issue. We introduce 
theory of the Kurdyka-Lojasiewicz inequality to address this issue. In particular, we show 
that many nonconvex problems enjoy the Kurdyka-Lojasiewicz property and establish the 
global convergence result of the corresponding MM procedure. We also extend our result 
to a well known method that called CCCP (concave-convex procedure). 

Keywords: nonconvex optimization, majorization minimization, Kurdyka-Lojasiewicz 

inequality, global convergence 


1. Introduction 

Majorization minimization (MM) algorithms have wide applications in machine learning and 
statistical inference (Lange et ah, 2000, Lange, 2004). The MM algorithm can be regarded 
as a generalization of expectation-maximization (EM) algorithms, and it aims to turn an 
otherwise hard or complicated optimization problem into a tractable one by alternatively 
iterating an Majorization step and an Minimization step. 

More specifically, the majorization step constructs a tractable surrogate function to sub¬ 
stitute the original objective function and the minimization step minimizes this surrogate 
function to obtain a new estimate of parameters in question. In the conventional MM al- 
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gorithm, convexity plays a key role in the construction of surrogate functions. Moveover, 
convexity arguments make the conventional MM algorithm have the same convergence prop¬ 
erties as EM algorithms (Lange, 2004). 

Alternatively, we are interested in use of MM algorithms in solving nonconvex (non¬ 
smooth) optimization problems. For example, nonconvex penalization has been demon¬ 
strated to have attractive properties in sparse estimation. In particular, there exist many 
nonconvex penalties, including the (.q {q G (0,1)) penalty, the smoothly clipped absolute 
deviation (SCAD) (Fan and Li, 2001), the minimax concave plus penalty (MCP) (Zhang, 
2010a), the capped-t’i function (Zhang, 2010b, Zhang et ah, 2012, Gong et ah, 2013), the 
LOG penalty (Mazumder et ah, 2011, Armagan et ah, 2013), etc. However, they might 
yield computational challenges due to nondifferentiability and nonconvexity that they have. 
An MM algorithm would be a desirable choice (Lange, 2004). 

In this paper we would like to address the global convergence property of MM algorithms 
for nonconvex optimization problems. Our motivation comes from the novel Kurdyka- 
Lojasiewicz inequality. In the pioneer work (Lojasiewicz, 1963, Lojasiewicz, 1993), the 
author provided the “Lojasiewicz inequality” to derive finite trajectories. Later on, Kurdyka 
(1998) extended the Lojasiewicz inequality to definable functions and applications. Bolte 
et al. (2007) then extended to nonsmooth subanalytic functions. Recently, the Kurdyka- 
Lojasiewicz property has been used to establish convergence analysis of proximal alternating 
minimization or coordinate descent algorithms (AttoTich et ah, 2010, Xu and Yin, 2013, 
Bolte et ah, 2013). 

We revisit a generic MM procedure of solving nonconvex optimization problems. We 
observe that many nonconvex penalty functions satisfy the Kurdyka-Lojasiewicz inequality 
and such a property is shared by a number of machine learning problems arising in a 
wide variety of applications. Specifically, we demonstrate several examples, which admit 
the Kurdyka-Lojasiewicz property. Thus, we conduct the convergence analysis of the MM 
procedure based on theory of the Kurdyka-Lojasiewicz inequality. More specifically, our 
work offers the following major contributions. 

• We discuss a family of nonconvex optimization problems in which the objective func¬ 
tion consists of a smooth function and a non-smooth function. We give the construc¬ 
tive criteria of surrogates that approximate the original functions well. Additionally, 
we also illustrate that many existing methods for solving the nonconvex optimization 
problem can be regarded as an MM procedure. 

• We establish the global convergence results of a generic MM framework for the non¬ 
convex problem which are obtained by exploiting the geometrical property of the 
objective function around its critical point. To the best of our knowledge, our work is 
the first study to address the convergence property of MM algorithms for nonconvex 
optimization using the Kurdyka-Lojasiewicz inequality. 

• We also show that our global convergence results can be successfully extended to 
many popular and powerful methods such as iteratively re-weighted ii minimization 
method Gandes et al. (2008), Ghartrand and Yin (2008), local linear approximation 
(LLA) Zou and Li (2008), Zhang (2010b), concave-convex procedure (GGCP) Yuille 
and Rangarajan (2003), Lanckriet and Sriperumbudur (2009), etc. 
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1.1 Related Work and Organization 

We discuss some related work about the convergence analysis of nonconvex optimization. 
Vaida (2005) established the global convergence of EM algorithms and extended it to the 
global convergence of MM algorithms under some conditions. However, they considered 
the differentiable objective function, whereas the objective function in our paper can be 
nonsmooth (also nonconvex). This implies that the problem we are considering is more 
challenging. Additionally, Vaida (2005) assumed that all the stationary points of objective 
function are isolated. In our paper, we don’t require this assumption. The isolation assump¬ 
tion does not always hold, or holds but is difficult to verify, for many objective functions in 
practice. This motivates us to employ the Kurdyka-Lojasiewicz inequality to establish the 
convergence. Moreover, it is usually easily verified that the objective function admits the 
Kurdyka-Lojasiewicz inequality. Gong et al. (2013) proposed an efficient iterative shrinkage 
and thresholding algorithm to solve nonconvex regularized problems. The key assumption 
is that the computation of proximal operator of the regularizer has a closed form. We note 
that this method falls into our MM framework. However, the authors only showed that 
the subsequence converges to a critical point. Mairal (2013) studied instead asymptotic 
stationary point conditions with first-order surrogate functions, but he did not propose the 
convergent sequence which converges to the solution point. 

Attouch et al. (2010), Xu and Yin (2013), Bolte et al. (2013) employed the Kurdyka- 
Lojasiewicz inequality to analyze the convergence of nonconvex optimization problems. 
They are mainly concerned with the convergence analysis of the block coordinate ap¬ 
proaches. In this paper, we pay attention to the global convergence analysis of the MM 
framework for solving nonconvex regularization problems. Specifically, we construct surro¬ 
gates both on the smooth and nonsmooth terms. To achieve the global convergence, we 
exploit the geometry property of the objective function around its critical point. 

The remainder of the paper is organized as follows. Section 2 provides preliminar¬ 
ies about the nonsmooth and nonconvex analysis and introduces the Kurdyka-Lojasiewicz 
property. We also give some examples which enjoy the Kurdyka-Lojasiewicz inequality. In 
Section 3, we formulate the problem we are interested in and make some common assump¬ 
tions. A generic majorization minimization algorithm is revisited in Section 4. Section 5 is 
the key part of our paper which gives the global convergence results. In Section 6 we extend 
our work to CCCP. In Section 7 we conduct numerical examples to verify our theoretical 
results. Finally, we conclude our work in Section 8. 

2. Preliminaries 

In this section we introduce the notion of Frechet’s subdifferential and a limiting-subdifferential. 
Then we present the novel Kurdyka-Lojasiewicz inequality. First of all, for any u = 
(tti,..., Up)'^ G MP and v = (ui,..., Up)^ G MP, we denote (u, v) = ^1^=1 ll'^ll = 

Y^(u7u) here and later. 


Definition 1 (Subdifferentials) (Rockafellar et al, 1998) Consider a proper and lower 
semi-continuous function / : —)■ (—oo, -|-oo] and a point x G dom(/). 
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(i) 


The Frechet subdifferential of f at x, denoted df{x), is the set of all vectors u G 
which satisfy 


lim inf 
y / X 
y X 


/(y) - fix) - u^(y 

l|y - x|| 


x) 


> 0 . 


(ii) The limiting-subdifferential of f at x, denoted df{x), is defined as 


df{x) = < u G : 3xfc — )> x, /(x^) — )• /(x) and G df{xk) —>■ u as A: —>■ oo 


Remark 2 Here domf = |x : /(x) < +oo|. If x ^ dom/, one sets df{x) = 0. It is 
worth pointing out that df{x) for each x is closed and convex while df{x) is closed. If 
f is differentiable at xq, then 9/(xo) = {V/(xo)} and V/(xo) G (9/(xo). More details 
are referred to Rockafellar et al. (1998). As we see, both the Frechet subdifferential and 
limiting-subdifferential are applicable for nonconvex functions. 

Corollary 3 (Rockafellar and Wets, 1998) Suppose F = / + r : M. Moreover, 

f is smooth in the neighborhood of xq and r is finite at xq. Then, we have 

dF{xo) = V/(xo) + <9r(xo) and dF{xo) = V/(xo) + «9r(xo). 

Definition 4 It is said that x* G is a critical point of a lower semi-continuous function 
F : ^ M U l+cx)}, if the following condition holds 

0 G dF{x*). 

Remark 5 Ifx* is a minimizer (not necessarily global) of function F, we can conclude that 
0 G dF{x*). The set of critical points of F is denoted by critF. 


2.1 Kurdyka-Lojasiewicz properties 

With the notion of subdifferentials, we now briefly recall the Kurdyka-Lojasiewicz inequality, 
which plays a central role in our globally convergence analysis. 

Definition 6 Let the function F ■.W’ ^ {—oo, -|-oo] be proper and lower semi-continuous. 
Then F is said to have the Kurdyka-Lojasiewicz property at u G domdK if there exist 
rj G (0,-|-oo], a neighborhood U of n, and a continuous concave function cj) : [0,??) —)• M+ 
with the following properties: 

(a) m = 0, 

(b) cj) is on (0, rj), 

(c) for all t G (0, r/), (f'{f) > 0, 

such that for all a inU n[K(u) < K(u) < F{u) the following Kurdyka-Lojasiewicz 
inequality holds true: 

(l)'{F{u) — F(u))dist(0, (9F(u)) > 1. 

Here dist(u, = infv {||u — v||, v G 
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It is well established that real analytic and sub-analytic functions satisfy the Kurdyka- 
Lojasiewicz property (Bolte et ah, 2007). Moreover, the sum of a real analytic function 
and a subanalytic function is subanalytic (Boclinak et ah, 1998). Thus, the sum admits 
the Kurdyka-Lojasiewicz property. Many functions involved in machine learning satisfy the 
Kurdyka-Lojasiewicz property. For example, both the logistic loss and the least squares 
loss are real analytic. 

We also hnd that many nonconvex penalty functions, such as MCP, LOG, SCAD, and 
Capped hi, enjoy the Kurdyka-Lojasiewicz property. Here we give two examples. First, the 
MCP function is dehned as 


C(h; a,7) 


A(|t|-^) if|t|<A 7 , 

^ if |t| > A 7 , 


where A, 7 > 0 are constants. The graph of ( is the closure of the following set 


(t,s) : s = < -A 7 I U |(t,s) : s = > A 7 


t 


27 


U < (t, s) : s = —At — — A 7 < t < 0 > U < (t, s) : s = At — —, 0 < t < Aj y 


t^ 


27 


This implies that MCP is a semi-algebraic function (Boclinak et ah, 1998), which is sub¬ 
analytic (Bolte et ah, 2007). Thus, MCP satishes the Kurdyka-Lojasiewicz property. Simi¬ 
larly, we can obtain that the SCAD and capped hi penalties have the Kurdyka-Lojasiewicz 
property. 

Second, the LOG penalty is dehned as 


((t; A, a) = A log(l -h a|t|), for a > 0. 


The graph of ( is the closure of the following set 


{t, s) : s = A log(l + at),t > 0 > U < (t, s) : s = Alog(l — at),t < 0 


Note that the graph is sub-analytic (Bolte et ah, 2007), so the LOG penalty is sub-analytic, 
which enjoys the Kurdyka-Lojasiewicz property. 


3. Problem and Assumptions 

In this paper we are mainly concerned with the following optimization problem 

min |t(w) =/(w)-|-r(w)'i. ( 1 ) 

weMP I J 

Many machine learning problems can be cast into this formulation. Typically, /(w) is de¬ 
hned as a loss function and r(w) is dehned as a regularization (or penalization) term. 
Specihcally, given a training dataset V = {(xi, yi), (x 2 , 2 / 2 ), • • • ,(x„,y„)}, one dehnes 
/(w) = ^ common setting for the penalty function r(w) is 

A), where A is the tuning parameter controlling the trade-off between the loss 
function and the regularization. 
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Recently, many nonconvex penalty functions, such as LOG (Mazumder ct ah, 2011, 
Armagan et al., 2013), SCAD (Fan and Li, 2001), MCP (Zhang, 2010a), and the capped-£i 
function (Zhang, 2010b), have been proposed to model sparsity. These penalty functions 
have been demonstrated to have attractive properties theoretically and practically. 

Meanwhile, iteratively reweighted methods haven widely used to solve the optimization 
problem in (1). Usually, the iteratively reweighted method enjoys a majorization minimiza¬ 
tion (MM) procedure. In this paper we attempt to conduct convergence analysis of the MM 
procedure. For our purpose, we make some assumptions about the objective function. 

Assumption 7 Suppose f :MP ^ M_|_ is a smooth function of the type . Moreover, the 
gradient of f is Lf-Lipschitz continuous; that is, 

||V/(u)-V/(v)||<L;||u-v|| (2) 

for any u, v G where Lf > 0 is called a Lipschitz constant ofVf. 

Corollary 8 Let h(w) = Suppose /i(w) is differentiable for any i £ [n] = 

{1, 2, • • • , n}. If each V/j(w) is Li-Lipschitz continuous {Li > 0), then h{w) is differentiable 
and V/i(w) is \oii\Li-Lipschitz continuous. 

Lemma 9 /f / : — )■ M is differentiable and V f is Lj-Lipschitz continuous. Then 

/(u) </(v) + (V/(v),u-v) + ^||u-vf (3) 


for any u, v G M^. 

This is a classical result whose proof can be seen from Nesterov and Nesterov (2004). 

Assumption 10 T : —)■ M is lower semi-continuous and coercive and it satisfies 

infweRp -^(w) > -oo. 


We give several examples to show that the assumptions hold in many machine learn¬ 
ing problems. For the linear regression, /(w) = 2 ^||Xw — y|p, where w G M^,X = 
[xi, • • • ,x„]^ G is the input matrix and y = [yi, • • • ,ynV' ^ is the output vector. 

In this example, the Lipschitz constant of V/(w) is lower-bounded by the maximum eigen¬ 
value of -X^X. In binary classification problems in which yi G { — 1,1}, we consider the 
logistic regression loss function. Specifically, /(w) = ^ X]r=i io§(i- + The 

Lipschitz constant of V/(w) is lower-bounded by ^ 

4. Majorization Minimization Algorithms 

We consider a minimization problem with the objective function F{w). Given an estimate 
at the kth iteration, a typical MM algorithm consists of the following two steps: 

1. A function (/(u) on is said to be coercive if lim||u||_>oo 5 (u) = oo 
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Table 1: Examples of nonconvex penalties for one dimension 


'eTjnction^^'^^ 

C{t) 

LOG 

iog(e+i) log(l + 6i|t|), (0 > 0) 

SCAD 

< 

A t IF t < A, 

ipA<|t|<0A, (0>2) 

IF0A<|t|, 

MCP 

I ^(1^1 - 2^) IF 1^1 < ^7, 

1 ^ iF|t|>A7. 

Capped £i-penalty 

Amin( t , 0), (0 > 0) 


Majorization Step: Substitute E(w) by a tractable surrogate function (5(w|w(^)), 
such that 

Q(w|w('=)) > F{ w) 

for any w G domE, with equality holding at w = 

Minimization Step: Obtain the next parameter estimate by minimizing 

(5(w|w^) with respect to w. That is, 

^(fc+i) _ argmin Q(w|w^^^). 


In order to address the global convergence of MM for solving the problem in (1), we 
propose a generic MM framework under the assumptions given in the previous section. 
We particularly present criteria to devise the majorant functions of the loss function and 
penalty function, respectively. 

4.1 Majorization of Loss Function 

We first consider the majorization of the loss function /(w). Recall that V/(w) is assumed 
to be Lipschitz continuous (Assumption 7). Given the estimate of w at the kth 

iteration, one would derive the majorization of /(w) to obtain For the sake of 

simplicity, we denote the corresponding surrogate function as (5/(w|w(^^). In our work, we 
claim that Q should have the following two properties so that the surrogate can 
approximate the objective / well and also lead to efficient computations. 

Assumption 11 Let (5/(w|w(^)) be the majorization of /(w) such that (5/(w|w(^)) > 
f{w) and = /( Additionally, the following properties also hold: 

(i) Q/(w|w(^)) — /(w) is ^-strongly convex, where 7 > 0 ; 

(ii) VQ/(w|w*^^)) is Lipschitz continuous, andVQf{'w^^'>\'w^^'>) = Vf{w^^^). 

Let us see several extant popular algorithms which meet Assumption 11. Proximal 
algorithms (Rockafellar, 1976, Lemaire, 1989, lusein, 1999, Combettes and Pesquet, 2011, 
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Parikh and Boyd, 2013) solve optimization problems by using a so-called proximal operator 
of the objective function. Suppose we have an objective function /(w) at hand. Given the 
kth estimate the proximal algorithm aims to solve the following problem 

= argmin |/(w) + — ||w - f 

w [ ‘^O'k 

where is the step size at each iteration. Typically, is written as: 

^Prox„,y(wW). 

The majorization function is defined as (5/(w|w^^^) = /(w) + IP- When /(w) 

is convex and V/(w) is Lipschitz continuous, it is easy to check that (5/-(w|w^^p satisfies 
Assumption 11. 

Another powerful algorithm is the proximal gradient algorithm. The algorithm is more 
efficient when dealing with the following problem 

w* = argmin < /(w) -|- r(w) >, (4) 

W ^ J 

where / is differentiable and convex and r is nonsmooth. The proximal gradient algorithm 
first approximates /(w) based on a local linear expansion plus a proximal term, both at 
the current estimate That is, 

/(W) « /(w(")) + (V/(wW), W - w(")) + ;^||W - 

where > 0 is the step size. Then the (A:+l)th estimate of w is given as 


(fc+P = argmin |/(w^P) + (V/(w(P), w - 

W 

(P|P + r(w)|. 


H-w — w' 

2afc" 


( 5 ) 


Equivalently, 

w(fe+i) = Prox,,,(w(P - afcV/(wP=))). 

Intuitively, the proximal gradient algorithm would take the gradient descent step first and 
then does the proximal minimization step. In this algorithm, = /(w W) + 

(V/(w(^p,w — + 21^11'''^ ~ It is also immediately verified that 

satisfies Assumption 11 when where Tj is the Lipschitz constant of V/(w). 

In fact. Lemma 9 implies that there always exists a quadratic surrogate of / only if (2) 
holds. In particular, we can define Qj as 


Q/(w|w(P) = /(w^P) + (V/(w(P) 


w — w 


(fc)\ 


+ 


^l|w-wW||2, 


( 6 ) 


where we require that > Lf- 
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Figure 1: Surrogate for logistic loss 


4.2 Majorization for Nonconvex Penalty Functions 

We assume that the penalty function r(w) = C(l^^il)- Thus we can construct the 

surrogates for separately. It should be emphasized that the majorization of the penalty 
function is not always necessary. For instance, when one can easily obtain 

.^(fc+i) _ argmin |q/(w|w^*^^) + r(w)|, (7) 

the surrogate for r(w) may not be considered. That is to say, this procedure is optional. 
However, the surrogate for r(w) can result in efficient computations sometimes, especially 
when handling the proximal operator of r(w) suffers a large computation burden. 

We consider a more general case and give some assumptions. 

Assumption 12 Let r(w) = where the map (: M+ —)■ M+ is eoncave and 

differentiable. Moreover, Q'{f) is Lipschitz continuous on [0,+oo). That is, 

IC(H) - ('{^ 2)1 < - ^ 2 !, 


for any ti,t 2 > 0. 

Many nonconvex penalties admit such properties, such as nonconvex LOG penalty, 
MCP, SCAD, etc. Although the fg-norm {q G (0,1)) may not satisfy the gradient Lipschitz 
continuous condition, we alternatively consider Cd'U^I) = ^(1 + ct\w\)'^, with a > 0, which is 
gradient Lipschitz continuous on [0, + 00 ). 

Thanks to concavity, we have 

C(kil) < Cdw^f^l) + C'd^i'f^l)d^«^l - (8) 
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Figure 2; Surrogate for nonconvex regularization 


for any i G {1, • • • ,p}. Thus, the majorant function for r(w), denoted by (5r(w|w(^)), 

<3r(w|w(^)) = +C'(kf^l)(|w^i| - hf^l) • 

i=l 


is 

(9) 


It is easy to see that Qr(w|w(^)) > r(w) and = r{'w^^')). Moreover, the 

corresponding surrogates transfer nonconvex objectives into convex ones, which brings ef¬ 
ficient and stable computations. As illustrated in Figure 2, at each iteration, we optimize 
the tangent above the nonconvex penalty which is tight at the current estimate. 

The key idea was early studied in DC programming (Gasso et al., 2009), which linearizes 
iteratively concave functions to obtain convex surrogates. The idea has been also revisited 
by Zou and Li (2008), Candes et al. (2008), Chartrand and Yin (2008). 

Specifically, Zou and Li (2008) developed the local linear approximation (LLA) algorithm 
and pointed that the LLA algorithm can be cast as an EM algorithm under certain condition. 
The LLA algorithm uses the same majorant function as in (9) for the nonconvex and 
nonsmooth penalty function. 

Candes et al. (2008) studied a so-called iteratively re-weighted ii minimization, which 
also falls into the MM procedure. For example, when 

P I 

r(w) = log(l -|—|?i>j|), where e > 0, 

1 = 1 


the re-weighted minimization scheme is given as 


_ argmin 


/(w) + A^ — 
i=\ m 


Wi 


(Di 


-Fe 


which can be also derived from (9). 


( 10 ) 
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Algorithm 1 Majorization Minimization Algorithm for Nonconvex Penalization 

1: Initialize and T (the maximum number of iterations), set k = 0. 

2: repeat 

3: Compute Q which is the surrogate of /(w); 

4: Compute (5r(w|w(^)), which is the surrogate of ^(w); 

5: Update by (7) or (12); 

6: k — k 

7: until Some stopping criterion is satisfied. 


To be the best of our knowledge, there is few complete convergence results for these 
algorithms. In particular, it is hard to address the convergence of the sequence In 

most of traditional treatments, the asymptotic stationary point is studied. These treatments 
usually follow the general convergence results for MM algorithms (Lange, 2004) without 
exploiting the property of the objective function. Our global convergence results given in 
Section 5 are based on theory of the Kurdyka-Lojasiewicz inequality. Our results directly 
apply to the LLA and iteratively re-weighted ii minimization algorithms. 


4.3 A Generic MM Algorithm 

We are now ready to summarize the whole MM procedure. Recall that the original 
problem (1) includes two parts. We first consider the simple case. The “simple” means the 
following problem can be handled easily: 


w* = argmin 

W 


1 


w — u 


-|- r(w) 


( 11 ) 


This implies that (7) can be efficiently solved. This leads to nonconvex proximal-gradient 
methods (Fukushima and Mine, 1981, Lewis and Wright, 2008). We thus generate a sequence 
{w(^)}fcgN by (7). 

However, when (11) is intractable, we substitute r(w) with (9). Then in each iteration, 
the problem reduces to 


_ argmin 


Qj(w|w*^^^) -|- (5r(w|w^^^) 


The whole procedure is briefly presented in Algorithm 1. 


( 12 ) 


5. Convergence Analysis 

We now study the convergence analysis of Algorithm 1 . It should be claimed that the global 
convergence, which is our focus, means that for any G the sequence 
generated by (7) or (12) converges to the critical point of T(w). 

Lemma 13 Suppose Assumptions 10, 11 hold or Assumptions 10, 11, and 12 hold. 

Then the sequence generated by (7) or generated by (12) satisfies the following 

properties. 
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(i) The generated sequence |F(w is non-increasing, specifically, 

i?(w(^)) - VA: > 0. 


(ii) 

OO 

< +CX), 

k=0 

which implies limfc_^oo = 0. 

Lemma 13 enjoys the descent property of the MM approach which always makes the ob¬ 
jective term decrease after each iteration. Moreover, the objective function value decreases 
at least for the /cth step. By the fact that infwi^(w) > —oo, we can 

draw the conclusion that the sequence {F(w(^))}fcgN converges. 

Because of the coerciveness of function F{w), there exists a convergent subsequence 
that converges to w. The set of all cluster or limit points which are started 
with is denoted by That is, 

= |w G : 3 Uk, {ufcj^gjsj, such that w as k —>■ oo 

It is also easy to see that T(w) is constant and finite on Al(w^®^). In the following 
lemma we attempt to demonstrate that all points which belong to are stationary 

or critical points of T’(w). 

Lemma 14 Suppose Assumptions 7, 10, 11 hold, and the sequence is gener¬ 
ated by (7). Let — V(5/(w(^+^)|w(^^). Then 

(i) G 9 F(w(^+i)); 

(ii) ||A(^+i)|| < (Lg^ +L/)||w(^+i) - w(*^)||. 

Lemma 15 Suppose r(w) = where Q: M+ —)• IR+ is concave and continuous 

differentiable on [0, +oo). Let = Yl^=i Then 

(i) Qr(w|w(^)) > r(w) and = r(w(^)); 

(ii) 5Qr(w(^dw(^)) = 5r(w(^)). 

Lemma 15 shows the relationship between the nonconvex (nonsmooth) penalty function 
and the corresponding surrogate. This also implies that the surrogate approximates the 
penalty function well. 

We introduce the notion of sgn(ri), which is defined as 

( 1 if ti > 0, 

sgn(tt) = s c if ti = 0, (13) 

—1 if tt < 0. 

Here c is some real number in [—1,1]. We emphasize that sgn(u) is a scalar rather than a 
set. 
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Lemma 16 (Main Lemma) Suppose Assumptions 7, 10, 11, 12 hold, and the se¬ 
quence is generated by (12). Let bf’'^ = for 

ie \p], = V/(w(^+i ))-Then 

(i) 5(^+1) E 5F(w(^+i)); 

(ii) ||5(^+i)|| < (Lq^ +Lj + L(^)||w(^+i) -w(^)||. 

Both Lemma 14 and Lemma 16 suggest a subgradient lower bound for the iterate gap. 
Due to the majorization of the nonconvex and nonsmooth penalty functions, it is more 
challenging to bound the subgradient. The ingredient is to observe that the majorant 
function and the original one share the same subgradient at the current estimate. With 
Lemmas 14 and 16, we are now ready to state the following lemma. 

Lemma 17 Suppose Assumptions 7, 10, 11, 12 hold. Let the sequence be 

generated by (7) or (12). Then 

(i) M.(wT‘)) is not empty and C critT; 

(ii) 

lim dist 

k^oo 

Lemma 17 implies that is the subset of stationary or critical points of T’(w) 

and are approaching to one point of Af(wi^i). Our current concern is to prove 

limfc^oo = w*. From Lange (2004), we know that At (w^^i) is connected. Additionally, 
if Al(wi°i) ig finite, converges . 

We can obtain the global convergence based on the assumption that Af(wi^i) is finite. 
However, the assumption is not practical, because it is usually unknown. Moreover, it is 
hard to check this assumption. To avoid this issue, the Kurdyka-Lojasiewicz property of 
the objective function enters in action, because it is often a very easy task to verify the 
Kurdyka-Lojasiewicz property of a function. 



Theorem 18 Suppose that F has Kurdyka-Lojasiewicz property at each point of domdF, 
and Assumptions 7, 10, 11, 12 hold. Let the sequence be generated by scheme 

(7) or (12). Then the following assertions hold. 

(i) The sequence has finite length. 


fc =0 


< oo 


(15) 


(ii) The sequence converges to a critical point w*of F. 

Theorem 18 shows the global convergence of Algorithm 1. As we have stated, many 
methods for solving a nonconvex and nonsmooth problem, such as the re-weighted ii (Can- 
des et ah, 2008) and LLA (Zou and Li, 2008), share the same convergence property as in 
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Theorem 18. Attouch et al. (2010), Bolte et al. (2013) have well established the global 
convergence for nonconvex and nonsmooth problems based on the Kurdyka-Lojasiewicz in¬ 
equality. It is also interesting to point out that their procedures fall into (7). However, they 
focused on a coordinate descent procedure. The work of Attouch et al. (2010), Bolte et al. 
(2013) cannot be trivially extended to our more general case. 

Theorem 19 (Convergence Rate) Suppose Assnmptions 7, 10, 11, 12 hold, and 
is generated by (7) or (12) which converges to a critical point w* of F, which 
satisfies the Kurdyka-Lojasiewicz property at each point of dom dF with fift) = for 

c > 0 and 6 G [0,1). We have 

(i) if9 = 0, converges to 'w* in finite iterations; 

(ii) if 0 & (0, ^], — w*|| < Cp^, \/k > Kq, for some Kq > 0,C > 0, p £ (0,1); 

(hi) if 6 £ (|, 1), — w*|| < Ck~^o-^ ,\/k > Kq, for some Kq > 0,C > 0. 

Theorem 19 tells us the convergence rate of our MM procedure for solving the nonconvex 
regularized problem, which is based on the geometrical property of the function F around 
its critical point. We see that the convergence rate is at least sublinear. 

6. Extension to Concave-Convex Procednre 

In this section we show that our work can be extended to the concave-convex procedure 
(CCCP) (Yuille and Rangarajan, 2003). It is worth noting that CCCP can be also unified 
into the MM framework. 

The CCCP is usually used to solve the following problem: 

min ri(w) — u(w), 

W 

s.t. Ci(w) < 0, z G [n], (16) 

dj{w) = 0, j £ [m], 

where u, v and c* are real-valued convex functions and dj are affine functions. The CCCP 
algorithm aims to solve the following sequence of convex optimization problems: 

.^(fc+i) _ argmin 

W 

S.t. 

Denote C = |w : Cj(w) < 0,dj{w) = 0,i £ 
function of the feasible set C; that is, 

(5c(w) = I 
^ ^ I + 00 , 


u(w) - Vu(w(^))^ w, 

Ci('w) < 0, z G [n], 
dj{w) = 0, j G [m], 

[n],j £ [m]|, and let (5c(w) be the indicator 

w G C, 
w ^ C. 
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It is directly proved that 5c is a convex function. Now the original problem can be refor¬ 
mulated as 

min F(w) = |(Ic(w) -|- u(w) — u(w)|. (18) 

Thus, the CCCP approach would solve the following convex problem at each iteration: 

= argmin |(Ic(w) -|- u(w) — Vu(w^^^)^w|. (19) 

In fact, the CCCP approach can be viewed as an MM algorithm. In particular, since 
u(w) is convex, —v{w) is concave. As a result, we have 

-u(w) < -u(w(^)) - Vu(w(^))^(w - w(^)). 

This leads us to the linear majorization of —v{w). When the constant part is omitted, 
(17) or (19) are recovered. In summary, CCCP linearizes the concave part of the objective 
function. Next, we make some assumptions to address the convergence of CCCP. 

Assumption 20 Consider the problem in (18) where 5c, u, and v are convex functions. 
Suppose the following three asserts hold. 

(i) u(w) and u(w) are functions; 

(ii) u(w) is '^-strongly convex; 

(hi) Vu(w) is Lipschitz continuous. 

With the above assumption, the following theorem shows that the sequence 
generated by CCCP converges to the critical point of F(w). 

Theorem 21 (Global Convergence of CCCP) Suppose Assumption 10, 20 hold. And 
F satisfy the the Kurdyka-Lojasiewicz property at each point of domdF. Let the sequence 
he generated by (19). Then the conclusions of Theorems 18 and 19 hold. 

It is worth pointing out that the global convergence analysis for CCCP has also been 
studied by Lanckriet and Sriperumbudur (2009). Their analysis is based on the novel 
Zangwill’s theory. Zangwill’s theory is a very important tool to deal with the convergence 
issue of iterative algorithms. But it typically requires that Al(w(^)) is finite or discrete 
to achieve the convergent sequence (Wu, 1983, Lanckriet and Sriperumbudur, 

2009). In contrast, our analysis based on the Kurdyka-Lojasiewicz inequality does not need 
this requirement. 

7. Numerical Analysis 

In this paper our principal focus has been to explore the convergence properties of ma¬ 
jorization minimization (MM) algorithms for nonconvex optimization problems. However, 
we have also developed two special MM algorithms based on (7) and (12), respectively. 
Thus, it is interesting to conduct empirical analysis of convergence of the algorithms. We 
particularly employ the logistic loss and LOG penalty for the classification problem. We 
refer to the algorithms as MM-(a) and MM-(b) for discussion simplicity. 
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We evaluate both MM-(a) and MM-(b) on binary datasets^. Descriptions of the datasets 
are reported in Table 2. For each dataset {x*, 

^ n p 

^(w) = - ^ log(l + exp(-yixfw)) + A ^ log(l + Q;|u;i|), (20) 

^ i=l i=l 

'-V-' '-V-" 

/(w) r(w) 

where A > 0 and a > 0 are hyperparameters. We adopt the corresponding majorizantion 
function 






and 

<3r-(w|w(^)) = A^l log(l + a|u;f^|) +-- |u;f^|) 

I l + a|u;)'| 

As mentioned in the previous section, the Lipschitz constant of V/(w) is bounded by 
S X]r=i Typically, to set the value one often uses the line-search method (Beck 

and Teboulle, 2009) to achieve better performance. However, since we are only concerned 
with the convergence behavior of MM, we just set where p > 1. 

We plot the error between objective function values and the T(w*)(log scaled) vs. CPU 
times with respect to different hyperparameters settings in Figure 3. We observe that both 
MM-(a) and MM-(b) generate the monotone decreasing sequence and achieve 

nearly the same optimal objective value. We also find that MM-(b) runs faster than MM-(a). 
This implies that it is efficient to construct the majorization function of the LOG penalty. 
In fact, MM-(a) will cost more computations when one directly calculates the proximal 
operator of the LOG penalty. In contrast, MM-(b) only needs to do the soft-thresholding 
(shrinkage) operator on the current estimate. In summary, numerical experiments show 
that both MM-(a) and MM-(b) make the objective function value decrease and converge. 


Table 2: Description of the datasets 


Data sets 

n 

P 

storage 

leukemia 

72 

7129 

sparse 

news20 

19996 

1355191 

sparse 

covtype 

581012 

54 

dense 


8. Conclusions 

Majorization minimization (MM) algorithms are very popular in machine learning and sta¬ 
tistical inference. In this paper, we have employed MM algorithms to solve the nonconvex 
regularized problems. However, the convergence analysis of MM for nonconvex and nons¬ 
mooth problems is a challenging issue. We have established the global convergence results of 

2. http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html 
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Figure 3: performance of MM with different parameter settings 
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the MM procedure using the geometrical property of the objective function. In particular, 
our results are built on the Kurdyka-Lojasiewicz inequality. We have shown that our results 
also apply to the iteratively re-weighted f'l minimization method, local linear approximation 
(LLA), and concave-convex procedure (CCCP). 

Appendix A. Proofs 

A.l The proof of Corollary 8 

Proof Since /j(w) is differentiable for i G [n] and each V/j(w) is Lj-Lipschitz continuous, 
we have 

||V/i(u) - V/i(v)|| < Li\\u - v||, 

for i G [n]. Then 

n n 

IIVh(u) - Vh(v)II = II ^ a.V/,(u) - ^ a.V/,(v) || 

i=l i=\ 

n 

i=l 

n 

< (^|ai|Li)||u-v|| 

i=l 

So, V/i(w) is |Q^*l-^i Lipschitz continuous. ■ 


A.2 The proof of Lemma 13 
Proof We first consider (7) procedure, 
(i) Recall that 


.^(fc+i) _ argmin 


Qf{ 


w w 


(fc)'l 


+ r(w 


We have 


By the strongly-convex property of Q— /(w). 


( 21 ) 


+/(w(^)) > 


(VQ/(w(*^)|w^^^) - V/(w(^)),w(^+^) - wW) + |||w(^+^) - wWf 

= 2||w(^+i)-w 


7, 

2 ' 

(k)\\2 


The last equality complies with V(5/(w(^)|w(^)) = V/(w(^)). 

Combining with (21), we have 

/(wW) + r(w(^)) - /(w(^+i)) - r(w(^+i)) > |||w(^+^) - . 
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So 


F(wW) -F(w(^+1)) > - w^f. 


(ii) We summary the above inequality from A: = 0 to +oo. Then 

H-OO +00 


k=0 


k=0 


Notice infwT(w) > —oo, so 


+00 


||w(^+^) - w('=)|p < -(F(w(°)) - F(w(“))) < +00, 


fc =0 


which completes the proof. 

Let’s come to the sequence generated by (12). 

Similarly, 

= argmin <|(5j(w|w^^^) + Qr (w|wW)|. 

We obtain 

< 0 . ( 22 ) 

Since we have(similar to proof of Lemma 13) 

- /(w(^+^)) - Q/(w(^V^^^) + /(w^^)) > - wWf. 

On the other hand, 

Q,(w("+i)|wW) - Q,(w(")|w(")) = f^c(kf^l) + C'(kf l)(kf+'^l - hfl) 

i=l 

- + C'(l«^f^l)(l«^f^l - kf^l) 

i=l 


-qL.,(Ohqy^(fc+i)| _ |y^(0| 


2 = 1 
P 


2 = 1 

= r(w(^+^))-r(w(^)) 

Combining the above three inequalities, we have 

/(wW) + r(w(^)) - /(w(^+^)) - r(w(^+i)) > |||w(^+i) - w^f, 


which implies that 


F(w(^)) - F{w^^+^^) > . 
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A.3 The proof of Lemma 14 
Proof 


(i) Recall that 

^(fc+i) _ argmin |q/(w|w*'^^) + r(w)|. 

Writing down the optimality condition, we have 

0 = + (23) 

where G Let’s rewrite it as follow 

- V/(w(^+^)) + V/(w(^+i)) + = 0. 

Because = V/(w(^"''^)) — V(5/(w(^+^)|w(^^), we immediately have 

^(fc+i) ^ v/(w(*’+^)) + G 9F(w(*’+^)). 

(ii) With the Lipschitz continuous of VQj(w|w*^^)) and V/(w), we have 

r ||vg/(w(^+i)|w('=)) - VQ/(w(^)|w(^))|| < Lq^||w(^+i) - w(^)||, 

1 ||V/(w("+i))-V/( w < R/||w(^+^) - w(^)||. ^ ^ 

Hence, 

||A(^+i)|| = ||V/(w(^+i)) - VQ/(w(*’+i)| w W)ll 

= l|V/(w(^+i)) - V/(w('=)) + vg/(w(^v^^^) - VQ/(w(^+^)|w^^^)ll 

< ||V/(w(^+i)) - V/(wW)|| + ||VQ/(w(^)|w^''^) - VQ/(w(^+iV^^^)ll 

< L/||w(^+i) - w(^)|| +Lq^||w(^+i) - w(^)|| 

= (^Q/ +^/)l|w(^+^^ - w(^)|| 

(25) 


A.4 The proof of Lemma 15 
Proof 

(i) By the concavity of (), we have 

C(l^^^^l) < C(kf^l) + C'(l^^f^l)(l^«^l - kf^l), 

for any z G [p]. We immediately obtain 

g^(w|w('=)) > r(w) and g^(w(^)|w^''^) = r(wW). 
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(ii) Notice the fact that the subdifferential calculus for separable functions yields the 
follows (Rockafellar et ah, 1998). 

p 

= dC{wi) X dC{w2) • • • X dC{wn). 

i=l 

Since r(w) and Qr(w) are separable, we consider each dimension independently. For 
any i G [p], if rcj > 0, we have 

Similarly, if rej < 0, we have 

dir{w^'^'^) = {-C'(-rci)}, = {-C'(-Wi)}. 


For the case Wi = 0, we have 


So 


a,r(w(")) = [-C'(0),C'(0)], = [-C'(0),C'(0)]. 


A.5 The proof of Lemma 16 
Proof 


(i) By using 


Alternatively 


= argmin + Qr (w|wW)|. 


= argmin I<5/(wI^ Cd'J^r^l) + 

^ ^ i=l ' 


For the notation, we let Then the optimal condition 

yields 

0 = VQ/(w("+i)|wW) + (pS"\pf ,••• (26) 

We rewrite it as 

0 = vg/(w(^+iV^^^) - v/(w(^+i)) + v/(w(*^+^)) 

+ (C'(kf^l)sgn(?/;f’^^^),C'(k2^^|)sgn(u;f’^^^),-- - , C'dw^J,^^|)sgn(u;(''+^)))^ 

On the other hand 

bf^ = sgn(u;f+^^)(C'du;f ^1) - C'(kf^^^l), 
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for i G Ip]. So 

+ vg/(w(^+i)|w^^^) - V/(w('=+i)) + {b[^\b^2^\ • • • , b^jl^Y- 

Then, we have 

V/(w(^+^)) + (C'(|u>f’^^^|)sgn(u;i''+^^),C'(|u>f’^^^|)sgn(wf’^^^),-- - , C'(h^^’^^^|)sgn(reJ^+^)))^ 

= V/(w('=+i)) - Vg/(w(^+^V^^^) - {bf\b^2\--- 

_ j^(k+l) 


Notice that 

V/(w(^+^)) + (C'(|u>f’^^^|)sgn(u;i''’^^^),C'(|u>f’^^^|)sgn(u;f’^^^),-- - , C'(h^^’^^)|)sgn(M;J^+^)))^ 


So we have 

_B(fc+i) g 5F(w(^+i)). 

(ii) Similarly, with the Lipschitz continuous of Vgj(w|w(^)) , V/(w) and Q'{t){t > 0) we 
have 


( ||vg/(w(^+i)|wW) - vg/(wW|wW)ll < Tq^||w(*^+i) - wW||, 

\ iiv/(w(^+i)) - V/(wW)|| < L/||w(^+i) - wW||, (27) 

I |C'(tl)-C'(t2)|<Tc|il-i2|. 

Now, we are ready to bound the subgradient 

= ||V/(w("+i))-V/( W (^^) + vg/(w(^V^^^) 

- vg/(w('=+i)|w^''^) - ,b^p^Y\\ 

< ||v/(w(*^+^)) - v/(w(^))ll + ||vg/(w(^)|w^^^) - vg/(w(^+^)|w^^^)ll 

+ 11(61'=',if.'".<'f II 

< L,||w<'=+1> - wWlI + Lq, ||wl‘+‘l - wW|| + IKtf'.if," ■ ,6f ni 

(28) 


Then let’s bound 


\\{b[^\b^\ • • • , bp‘'^)'^\\. For each i G \p], we have 
= sgn(rcf+^^)(C'(|rcf ^1) - C'd^^f ^^^1))- 


By using |sgn(ri)|^^^^)| < 1, we have 

= |sgn(u;f+^^)(C'(|rcf^|) 

<l(C'(kf^l)-C'd^r'^l))l 

<Ld|^f+'^|-kPl| 
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Then 




w 


w 


<k 
<k 
= ^clK^i 
= h 


(fc+i)i 
1 I 

(fc+i) 


— \w 


(fc)l 


\w. 


(fc+l)l 


— m 


(^)i 


1 

(fc+i) 


— w 


Wi 


1 


Wo 


— W. 


(fc)| 


— w 




,W2 


— w, 


(k) 


w('=+i)-wW| 


-fulfil 


, w, 


(fc+1) _ 


w. 




(29) 


Combining (28) and (29), 

< (Tq^+T/ + Lc)||w('=+^)-wW 


A.6 The proof of Lemma 17 

Proof Since the F(w) is coercive, the sequence is bounded. Therefore there 

exists an increasing sequence {nk}j^^^ such that 

lim = w*. 

k^oo 

Recall that F(w) = /(w) + A X)f=i C(^i) is continuous. We have 

lim F(wi”'=i) = F{ w*). 

k^oo 

On the other hand, we know G dF{w^^^), G dF{w^^')). Moreover, from Lemma 
(14) and Lemma (16) it can be seen that as A; — >■ oo, — )• 0 and — )• 0. Remember 

that dF is close. So 0 G dF{w*), which contributes to that w* is a critical point of F. ■ 


A.7 Uniformized KL property 

Before providing the global convergence result, we first introduce a class of concave and 
continuous functions. Let ry G (0, +oo]. We are concerned with <1>^ which contain the class 
of all concave and continuous functions </> : [0, ry) —)• M+ satisfying the following properties: 

(a) lyi(O) = 0 and continuous at 0; 

(b) (p is on (0, ry); 

(c) p'{t) > 0, V t G (0,ry). 

Lemma 22 (Bolte et al. (2013)) Suppose Q is a compact set and let F : —)■ (—oo, oo] 

be a lower semi-continuous function. Moreover, F is constant on 11 and satisfy KL property 
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at each point of fi. Then there exist e > 0, t/ > 0 and such that for all u in 0, and 

all u in 

|u G : dist{u, Q) < e| P| |u : F{u) < F{u) < F{u) + rj 

one has, 


(f'{F{u.) — F{u))dist{0,dF{\i)) > 1. 


(30) 


A.8 The proof of Theorem 18 

Proof As is known, there exists an increasing sequence such that converges 

to w*. Suppose that there exists an integer no satisfy that = F{w*). Then it is 

clear that for any integer N > no, F{'w^^'l) = T(w*) holds. Then it is trivial to achieve 
the convergent sequence. Otherwise, we consider the case that > F{w*), Vfe G N. 

Because the sequence is convergent, it is clear that for any > 0, there exist 

one integer m such that F{'w^^'i) < T(w*) + r/ for all k > m. By using lemma 17, we have 
limfc^oo Al(w(®^)) = 0 which implies that for any e > 0 there exists a positive 

integer n such that dist{w^^\ < e for all k > n. Let I = max{m,n}. Then for 

any k > I, we have 

- F{w*)^dist{0,dF{w^^^)) > 1 . 

By the Lemma 14 and Lemma 16, we have 


(0) _i7( 

W*)^ 


> P 




-1 


(31) 


Lq^+\,+L, - 

we let dk,k+i= (j){F— F{w*)) — (j){F{w^^^^'l) — F{w*)). With the property of 
concave functions, we have 


dk,k+i > (/)'(F(w('^)) - F(w*))(F(w'=) - F{w^^+^'>)) 

> ^(t>'{F{w^'^'l) - F(w*))||w(^+^) - 

> _ 1 _ _^{k-l)u-l.{k+l) _^{k)u2 

2 {Lf + Lq^ + L(^) 

That is 

where M = ^ . Notice that 


Mdk,k+i\\^^^^ - ^^11 < ( 




So we have 


< Mdk,k+i + ^^1 


(32) 
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Then 


OO OO OO 

^ < M ^ dk,k+i+ ^ 

k=l+l k=l+l k=l+l 

< Mdi+i^oo + - w^'^ll 

< M(^(^F(w('+i)) - F(w*)) + ||w('+i) - wW II 


Let I —)■ OO. Since lini;_^oo ||w(*+^) — w(^)|| = 0 and linii^oo= F(w*), it is clear 
that 


OO 



So 


Then, we have 


OO 

iiw*'^“'“^^ — < OO. 

k=0 

lim y ||w(^+^) - w(^)|| =0, 

m—¥oo 

k=m 


for any m < 1. This suggests that is Cauchy sequence. As a result, it is a 

convergent sequence that converges to w*. ■ 


A.9 The proof of Theorem 19 

This is a classical result of KL function. Since the corresponding function 

(/)(t) = 6 G [0,1). 

As Attouch and Bolte (2009), the conclusions of Theorem 19 hold. 


A. 10 The proof of Theorem 21 
Proof Recall that 

.^(fc+i) _ argmin 

W 


5c(w) + u(w) — 


w 


We immediately have 

0 = g('=+i) + Vu(w(*=+i)) - Vu(w(^)), 

where g 

Because 5c(w) + u(w) — is y-strongly convex, we have 


> (0, w(") - w("+i)) + ^||w(^+i) - w(^)||2. 


(33) 
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By the convexity of t'(w), we obtain 


Thus, 


<5c(w(^^) +u(w(^)) -ul 


w 


F(w(^)) 


'-V- ''' 2 

F(w('=+1)) 


- wW 


(34) 


On the other hand, we known that 

0 = g(fc+i) + Vu(w(^+i)) - Vu(w(*^+i)) + Vu(w(*'+^)) - Vu(w('')). 

Let’s denote Vu(w(^)) — as Then G 

||C(^+^)|| = ||Vu(w(^)) - Vu(w(^+^))|| < - w(^)|| (35) 

Next we prove A4(w^*^^) are the subset of the crit F(w). Because of the coerciveness 
of the Function -F(w), there exists a bounded sequence {w’^'=}fcgpj, which satishes that 
limfc_,.oo w”* = w. Since (5c(w) is lower semicontinuous, we have 


lim inf 5c (w”''=) > 5c (w) (36) 

fc^OO 


On the other hand, 

5c(w*'^“'“^^) + u(w^^“'"^^) — Vu(w*'^^)'^w^^“'"^^ < 5c(w) + u(w) — Vu(w^^^)^w 


Rewrite the above formulation, we obtain 

5c(w*^^’*'^^) < 5c(w) — (u(w*^^’''^^) — u(w)) + — 

Substitute k with rik — 1- By the fact u,v are functions, we have 

limsup 5(w”'=) < 5c (w). 

k^oo 

Combing (36) and (37), we immediately have 

lim 5c (w”*") = 5c (w). 

fc^OO 

Notice that u(w),u(w) are continuous, we have 

lim T(w”'=) = T(w). 
k^oo 


W 


(37) 


(35) implies that —>■ 0 as /c —)• oo. Moreover, G dF{ w^^)). Remember the closeness 

of dF{'w), we have 0 G dF{'w). So A4(w(*^)) are the subset of the crit T(w). With (34) 
and (35) ready, the next proof is the same as that of Theorem 18. ■ 
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